World Health Organization

Context


Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. Also, some of the past research was done considering multiple linear regression based on data set of one year for all the countries. Hence, this gives motivation to resolve both the factors stated previously by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. Important immunization like Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on immunization factors, mortality factors, economic factors, social factors and other health related factors as well. Since the observations this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to lower value of life expectancy. This will help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population.

1. Does various predicting factors which has been chosen initially really affect the Life expectancy? What are the predicting variables actually affecting the life expectancy?

2.Should a country having a lower life expectancy value(<65) increase its healthcare expenditure in order to improve its average lifespan?

3.How does Infant and Adult mortality rates affect life expectancy?

4.Does Life Expectancy has positive or negative correlation with eating habits, lifestyle, exercise, smoking, drinking alcohol etc.

5.What is the impact of schooling on the lifespan of humans?

6.Does Life Expectancy have positive or negative relationship with drinking alcohol?

7.Do densely populated countries tend to have lower life expectancy?

8.What is the impact of Immunization coverage on life Expectancy?

Objective 1


Display the ability to build regression models using the skills and discussions from Unit 1 and 2 with the purpose of identifying key relationships, interpreting those relationships, and making good predictions.

Reminder, key here is to tell a good story.

Build Model 1

  • Identify key relationships
  • Ensure interpretability
  1. Perform regression analysis

  2. Report predictive ability
    1. Test/train set
    2. CV data
  3. Hypothesis Testing

  4. Interpret the coefficients

  5. Confidence intervals

  6. Practical and statistical significance

Model 2

- Product the best predictions as possible
- Interpretation is no longer required, hence complexity is no longer an issue
  1. Feature selection to avoid overfitting

  2. Create the model

  3. Compare model 1 vs. model 2

  4. Comment on the differences of the models and whether model 2 brings any benefit

Objective 2


- Nonparametric technique
- kNN or regression trees (select one)

Set of predictors from previous regression: (fill this out)

  1. Model

  2. A brief description of your nonparametric model’s strategy to make a prediction. Include Pros and Cons.

  3. Provide any additional details that you feel might be necessary to report.

  4. Report the test ASE using this nonparametric model so we can see how well it does compared to regression.

EDA

Suchi’s EDA


##  Life.expectancy.1      count             mean             sd       
##  Length:2           Min.   : 41.00   Min.   :5.454   Min.   :2.422  
##  Class :character   1st Qu.: 66.25   1st Qu.:5.693   1st Qu.:2.516  
##  Mode  :character   Median : 91.50   Median :5.933   Median :2.610  
##                     Mean   : 91.50   Mean   :5.933   Mean   :2.610  
##                     3rd Qu.:116.75   3rd Qu.:6.173   3rd Qu.:2.705  
##                     Max.   :142.00   Max.   :6.413   Max.   :2.799
##  Life.expectancy.1      count             mean               sd        
##  Length:2           Min.   : 41.00   Min.   :  95.05   Min.   : 169.8  
##  Class :character   1st Qu.: 66.25   1st Qu.: 387.22   1st Qu.: 838.8  
##  Mode  :character   Median : 91.50   Median : 679.40   Median :1507.8  
##                     Mean   : 91.50   Mean   : 679.40   Mean   :1507.8  
##                     3rd Qu.:116.75   3rd Qu.: 971.58   3rd Qu.:2176.8  
##                     Max.   :142.00   Max.   :1263.75   Max.   :2845.8

## 
##  F test to compare two variances
## 
## data:  Total.expenditure by Life.expectancy.1
## F = 1.3359, num df = 140, denom df = 39, p-value = 0.2947
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7751741 2.1398108
## sample estimates:
## ratio of variances 
##           1.335916
## 
##  F test to compare two variances
## 
## data:  percentage.expenditure by Life.expectancy.1
## F = 280.92, num df = 141, denom df = 40, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  164.1030 448.0117
## sample estimates:
## ratio of variances 
##             280.92
## 
##  Two Sample t-test
## 
## data:  Total.expenditure by Life.expectancy.1
## t = 1.9683, df = 179, p-value = 0.05058
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.002459138  1.921558429
## sample estimates:
## mean in group High  mean in group Low 
##            6.41305            5.45350
## 
##   Welch's Heteroscedastic F Test (alpha = 0.05) 
## ------------------------------------------------------------- 
##   data : percentage.expenditure and Life.expectancy.1 
## 
##   statistic  : 17.24616 
##   num df     : 1 
##   denom df   : 100.4256 
##   p.value    : 6.901585e-05 
## 
##   Result     : Difference is statistically significant. 
## -------------------------------------------------------------
##  Life.expectancy.1      count            mean             sd       
##  Length:2           Min.   :33.00   Min.   :19.60   Min.   :28.41  
##  Class :character   1st Qu.:38.25   1st Qu.:23.38   1st Qu.:29.28  
##  Mode  :character   Median :43.50   Median :27.16   Median :30.14  
##                     Mean   :43.50   Mean   :27.16   Mean   :30.14  
##                     3rd Qu.:48.75   3rd Qu.:30.94   3rd Qu.:31.01  
##                     Max.   :54.00   Max.   :34.72   Max.   :31.88

## 
##  F test to compare two variances
## 
## data:  Total.expenditure by Life.expectancy.1
## F = 1.6541, num df = 52, denom df = 31, p-value = 0.1357
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8508704 3.0524391
## sample estimates:
## ratio of variances 
##           1.654055
## 
##  Two Sample t-test
## 
## data:  percentage.expenditure by Life.expectancy.1
## t = -2.2986, df = 85, p-value = 0.02398
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -28.191389  -2.040987
## sample estimates:
## mean in group High  mean in group Low 
##           19.60255           34.71874
## 
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality, data = Life_Expectancy_Df_2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.9654  -2.5457   0.8639   3.2843  13.1335 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     80.64428    0.71342  113.04   <2e-16 ***
## Adult.Mortality -0.06125    0.00391  -15.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.593 on 181 degrees of freedom
## Multiple R-squared:  0.5755, Adjusted R-squared:  0.5732 
## F-statistic: 245.4 on 1 and 181 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Life.expectancy ~ BMI, data = Life_Expectancy_Df_2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2151  -4.5711   0.3012   4.2668  23.9674 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 63.69240    1.21867   52.26  < 2e-16 ***
## BMI          0.19423    0.02643    7.35 6.81e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.484 on 179 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.2318, Adjusted R-squared:  0.2275 
## F-statistic: 54.02 on 1 and 179 DF,  p-value: 6.809e-12

## 
## Call:
## lm(formula = Life.expectancy ~ Alcohol, data = Life_Expectancy_Df_2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.3479  -4.4217   0.6886   5.5232  15.3815 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  68.1077     0.6866  99.192  < 2e-16 ***
## Alcohol       1.0732     0.1301   8.252 3.21e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.27 on 180 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2745, Adjusted R-squared:  0.2704 
## F-statistic:  68.1 on 1 and 180 DF,  p-value: 3.214e-14

Jamie’s EDA


Linear correlations: - Schooling vs Income.composition.of.resource - thinness..1.19.years vs thinness.5.9.years - life exp. vs schooling - life exp. vs income - infant death vs under 5 death

removed variables that were correlated

##    Country               Year         Status          Life.expectancy
##  Length:183         Min.   :2014   Length:183         Min.   :48.10  
##  Class :character   1st Qu.:2014   Class :character   1st Qu.:65.60  
##  Mode  :character   Median :2014   Mode  :character   Median :73.60  
##                     Mean   :2014                      Mean   :71.54  
##                     3rd Qu.:2014                      3rd Qu.:76.85  
##                     Max.   :2014                      Max.   :89.00  
##                                                                      
##  Adult.Mortality infant.deaths       Alcohol       percentage.expenditure
##  Min.   :  1.0   Min.   :  0.00   Min.   : 0.010   Min.   :    0.00      
##  1st Qu.: 66.0   1st Qu.:  0.00   1st Qu.: 0.010   1st Qu.:   11.06      
##  Median :135.0   Median :  2.00   Median : 0.320   Median :  151.10      
##  Mean   :148.7   Mean   : 24.56   Mean   : 3.271   Mean   : 1001.91      
##  3rd Qu.:216.5   3rd Qu.: 18.00   3rd Qu.: 6.700   3rd Qu.:  703.21      
##  Max.   :522.0   Max.   :957.00   Max.   :15.190   Max.   :19479.91      
##                                   NA's   :1                              
##   Hepatitis.B       Measles           BMI        under.five.deaths
##  Min.   : 2.00   Min.   :    0   Min.   : 2.00   Min.   :   0.00  
##  1st Qu.:79.00   1st Qu.:    0   1st Qu.:23.20   1st Qu.:   0.00  
##  Median :93.00   Median :   13   Median :47.40   Median :   3.00  
##  Mean   :83.12   Mean   : 1831   Mean   :41.03   Mean   :  32.89  
##  3rd Qu.:97.00   3rd Qu.:  316   3rd Qu.:59.80   3rd Qu.:  22.00  
##  Max.   :99.00   Max.   :79563   Max.   :77.10   Max.   :1200.00  
##  NA's   :10                      NA's   :2                        
##      Polio       Total.expenditure   Diphtheria       HIV.AIDS    
##  Min.   : 8.00   Min.   : 1.210    Min.   : 2.00   Min.   :0.100  
##  1st Qu.:80.00   1st Qu.: 4.480    1st Qu.:83.00   1st Qu.:0.100  
##  Median :94.00   Median : 5.840    Median :94.00   Median :0.100  
##  Mean   :84.73   Mean   : 6.201    Mean   :84.08   Mean   :0.682  
##  3rd Qu.:97.00   3rd Qu.: 7.740    3rd Qu.:97.00   3rd Qu.:0.400  
##  Max.   :99.00   Max.   :17.140    Max.   :99.00   Max.   :9.400  
##                  NA's   :2                                        
##       GDP              Population        thinness..1.19.years
##  Min.   :    12.28   Min.   :4.100e+01   Min.   : 0.100      
##  1st Qu.:   617.99   1st Qu.:2.869e+05   1st Qu.: 1.500      
##  Median :  3154.51   Median :1.568e+06   Median : 3.300      
##  Mean   : 10015.57   Mean   :2.106e+07   Mean   : 4.533      
##  3rd Qu.:  8239.95   3rd Qu.:8.080e+06   3rd Qu.: 6.600      
##  Max.   :119172.74   Max.   :1.294e+09   Max.   :26.800      
##  NA's   :28          NA's   :41          NA's   :2           
##  thinness.5.9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.100     Min.   :0.3450                  Min.   : 4.90  
##  1st Qu.: 1.500     1st Qu.:0.5700                  1st Qu.:10.80  
##  Median : 3.400     Median :0.7220                  Median :13.00  
##  Mean   : 4.676     Mean   :0.6884                  Mean   :12.89  
##  3rd Qu.: 6.600     3rd Qu.:0.7960                  3rd Qu.:14.90  
##  Max.   :27.400     Max.   :0.9450                  Max.   :20.40  
##  NA's   :2          NA's   :10                      NA's   :10     
##  Life.expectancy.1 
##  Length:183        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
##  [1] "Country"                         "Year"                           
##  [3] "Status"                          "Life.expectancy"                
##  [5] "Adult.Mortality"                 "infant.deaths"                  
##  [7] "Alcohol"                         "percentage.expenditure"         
##  [9] "Hepatitis.B"                     "Measles"                        
## [11] "BMI"                             "under.five.deaths"              
## [13] "Polio"                           "Total.expenditure"              
## [15] "Diphtheria"                      "HIV.AIDS"                       
## [17] "GDP"                             "Population"                     
## [19] "thinness..1.19.years"            "thinness.5.9.years"             
## [21] "Income.composition.of.resources" "Schooling"                      
## [23] "Life.expectancy.1"

Regression Testing/Variables Reduction


Forward, backward, and stepwise regressions were run and all 3 resulted with the same 4 significant variables.

Variables - Adult.Mortality
- Total.expenditure - HIV.AIDS - Income.composition.of.resources

## 
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + Total.expenditure + 
##     HIV.AIDS + Income.composition.of.resources + Life.expectancy.1, 
##     data = df1_complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7539 -1.5929  0.0074  1.6631 10.6939 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     50.76505    2.32027  21.879  < 2e-16 ***
## Adult.Mortality                 -0.01700    0.00378  -4.496 1.56e-05 ***
## Total.expenditure                0.39230    0.10971   3.576 0.000497 ***
## HIV.AIDS                        -0.55729    0.24900  -2.238 0.026985 *  
## Income.composition.of.resources 31.75821    2.96054  10.727  < 2e-16 ***
## Life.expectancy.1Low            -2.90429    1.08518  -2.676 0.008441 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.029 on 125 degrees of freedom
## Multiple R-squared:  0.8809, Adjusted R-squared:  0.8761 
## F-statistic: 184.9 on 5 and 125 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling + Life.expectancy.1, 
##     data = df1_complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6206 -1.7993  0.1196  1.4778  9.0945 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.416e+01  3.441e+00  15.739  < 2e-16 ***
## Adult.Mortality                 -1.707e-02  4.062e-03  -4.201 5.39e-05 ***
## infant.deaths                    5.084e-02  5.664e-02   0.897 0.371396    
## Alcohol                          5.116e-02  9.458e-02   0.541 0.589625    
## percentage.expenditure           3.714e-04  4.545e-04   0.817 0.415602    
## Hepatitis.B                     -8.781e-03  2.818e-02  -0.312 0.755941    
## Measles                         -2.284e-05  4.748e-05  -0.481 0.631396    
## BMI                             -7.922e-03  1.954e-02  -0.406 0.685872    
## under.five.deaths               -3.548e-02  3.892e-02  -0.912 0.363915    
## Polio                           -7.978e-03  2.073e-02  -0.385 0.701115    
## Total.expenditure                3.129e-01  1.243e-01   2.518 0.013221 *  
## Diphtheria                       2.522e-02  3.397e-02   0.742 0.459403    
## HIV.AIDS                        -5.789e-01  2.628e-01  -2.202 0.029702 *  
## GDP                             -2.934e-05  6.516e-05  -0.450 0.653446    
## Population                      -2.321e-09  6.646e-09  -0.349 0.727600    
## thinness..1.19.years             1.896e-02  2.296e-01   0.083 0.934327    
## thinness.5.9.years              -1.528e-01  2.261e-01  -0.676 0.500610    
## Income.composition.of.resources  2.717e+01  7.106e+00   3.824 0.000217 ***
## Schooling                        2.068e-02  2.732e-01   0.076 0.939815    
## Life.expectancy.1Low            -3.215e+00  1.303e+00  -2.468 0.015113 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.12 on 111 degrees of freedom
## Multiple R-squared:  0.8877, Adjusted R-squared:  0.8685 
## F-statistic:  46.2 on 19 and 111 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + Total.expenditure + 
##     HIV.AIDS + Income.composition.of.resources + Life.expectancy.1, 
##     data = df1_complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7539 -1.5929  0.0074  1.6631 10.6939 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     50.76505    2.32027  21.879  < 2e-16 ***
## Adult.Mortality                 -0.01700    0.00378  -4.496 1.56e-05 ***
## Total.expenditure                0.39230    0.10971   3.576 0.000497 ***
## HIV.AIDS                        -0.55729    0.24900  -2.238 0.026985 *  
## Income.composition.of.resources 31.75821    2.96054  10.727  < 2e-16 ***
## Life.expectancy.1Low            -2.90429    1.08518  -2.676 0.008441 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.029 on 125 degrees of freedom
## Multiple R-squared:  0.8809, Adjusted R-squared:  0.8761 
## F-statistic: 184.9 on 5 and 125 DF,  p-value: < 2.2e-16